Full-Text Indexing of BLOBs and XML
SQL Server 2008 can natively index content columns of the char, nchar, varchar, nvarchar, text, and xml data types. If you want to index binary large objects, you need to store them in the image or varbinary(max) column and associate with the image
column a column that will contain the extension the document would have
if it were stored in the filesystem. For example, if you were storing a
Word document in the image or varbinary(max) column, the document type column would have the value doc. While indexing the contents of the image or varbinary(max)
column, the Indexer reads the value of the document type column for
that row and launches the IFilter that corresponds to that value. SQL
Server 2008 ships with many IFilters. You can tell which document
extensions have IFilters by querying sys.fulltext_document_types:
SELECT * FROM sys.fulltext_document_types
If you are indexing a document stored in the image or varbinary(max) data type for which the extension is not listed in sys.fulltext_document_types, the indexer is unable to index the document. To enable indexing for unsupported document types, you must do the following:
1. | Download the IFilter for that document type and install it on the server running SQL Server.
|
2. | Enable the third-party IFilters to be used in SQL Server FTS. You do this by issuing the following commands:
Exec Sp_fulltext_service 'load_os_resources', 1 GO Exec Sp_fulltext_service 'verify_signature', 0 GO
|
LANGUAGE
By
default, the content in the columns you are full-text indexing is
broken by the word breakers according to the language rules for the
default full-text index language setting for your instance of SQL
Server. You establish this setting by issuing the following command:
sp_configure 'default full-text language'
go
name minimum maximum config_value run_value
----------------------------------- ----------- ----------- ------------ ---------
default full-text language 0 2147483647 1033 1033
Note the value for run_value. This is the locale identifier (LCID). To determine which language the LCID corresponds to, you issue the following:
SELECT name FROM sys.fulltext_languages WHERE lcid=1033
go
name
----------------------------------------------------------
English
In this example, 1033 is the value returned for run value in the sp_configure query. Note that this returns a list of the language word breakers that ship by default with SQL Server 2008.
The preceding execution of sp_configure returned the default full-text value of 1033,
which corresponds to English. Microsoft recognizes two types of English
in all Microsoft search products: English (U.S. English) and British
English (International English). There are very slight differences
between the two word breakers, mainly due to differing suffixes and
spellings (for example, British English recognizes connexion and colour as legitimate spellings).
By default, all columns are
full-text indexed by the word breaker that corresponds to your default
full-text language settings for your instance of SQL Server.
SQL Server FTS allows you to
use the language tag to specify word breakers for different languages
to be used to full-text index columns. For example, if you are storing
Traditional Chinese content in a column you want to index, and you want
it to be indexed using Traditional Chinese, you could issue the
following statement to create a full-text index:
CREATE FULLTEXT INDEX ON Person.Contact(FirstName,
LastName LANGUAGE 1028)
KEY INDEX PK_Contact_ContactID ON MyCatalog
This example full-text indexes two columns; one called FirstName is indexed using the server default full-text language, and the other, called LastName,
is indexed using the Traditional Chinese language word breaker. This
means that what ends up stored in the full-text indexes is broken
according to the language rules of the word breaker. For U.S. and
International English, the words are primarily broken at whitespace or
word boundaries (that is, punctuation marks). For other languages, the
word may be broken into constituent words or alternate words. For
example, if you use the German word breaker, wanderlust is broken as wanderlust, wandern, and lust, and all three words are stored in the index; searches on wanderlust, wandern, and lust all return hits to rows containing wanderlust.
You can specify
different language settings for each column you are full-text indexing,
but you can assign only one language setting for each column.
If you are storing BLOBs in the columns of the image or varbinary
data type and have a document-type column assigned to these columns,
depending on your content, the language settings within the content
themselves may override the language setting you specified to be used
for your full-text index or your SQL Server default full-text language
settings. For example, if you are indexing HTML or Word documents, have
marked these documents as Chinese, and have specified that the documents
be indexed in German, if your SQL Server default full-text language
setting is French, the content is indexed as Chinese. The same holds
true for XML documents stored in columns of the xml data type: the xml:lang setting determines the language in which these documents are indexed.
ON FULLTEXT CATALOG
The ON FULLTEXT CATALOG
parameter allows you to place your full-text index in a specific
catalog. If you have a default full-text catalog for the database, you
do not need to specify a catalog. You get better indexing and querying
performance if you place larger tables in their own full-text catalogs.
KEY INDEX
SQL Server FTS must be able to
identify the row that it is indexing or that is returned in the query
results. You specify which column is to be used as the key by using the KEY INDEX
parameter in your full-text index creation statement. As mentioned
previously, this column must be unique and non-nullable, and it must
have a single-column index that is not offline and have a maximum size
of 900 bytes. It can be a unique index or your primary key.
POPULATION TYPE
The process in which the indexer extracts your table content and builds a full-text index is called population. There are three types of populations:
Full
Incremental
Change tracking
No matter what
population type you choose, a full population is initially done first.
The full population extracts rows in batches and indexes them. It does
not do any change tracking, so your catalog starts to become out-of-date
as soon as the population completes.
An incremental population
occurs if there is a time stamp column on the table you are full-text
indexing. The incremental population extracts each row to determine
which rows have been updated and re-indexes only the changed rows. It
also determines which rows have been removed from the table you are
full-text indexing. A row is flagged to be re-indexed if any of the
columns are updated, so if you update one of the columns that is not
being full-text indexed, this row is indexed again.
You should use
incremental populations rather than full populations when a significant
amount of your table’s contents changes at any one time. If the bulk of
your table changes—around 90%—a full population is faster than an
incremental population.
You use the following commands to do a full population and an incremental population:
Use AdventureWorks;CREATE FULLTEXT INDEX ON Person.Contact(Firstname)
KEY INDEX pk_Contact_ContactID WITH CHANGE_TRACKING OFF, NO POPULATION
To then start a full or incremental population, you issue the following for full and incremental populations, respectively:
Use AdventureWorks;
ALTER FULLTEXT INDEX ON Person.Contact START FULL POPULATION — FULL POPULATION
Use AdventureWorks;
ALTER FULLTEXT INDEX ON Person.Contact START FULL INCREMENTAL — INCREMENTAL POP
At all other times, you
should use change tracking because it is much more efficient and offers
near-real-time indexing. Change tracking indexes rows that have had the
columns you are full-text indexing modified in near-real-time. Change
tracking starts by doing a full population but does an incremental
population if a timestamp column exists on the table. Change tracking
(like other population types) causes some locking on the tables you are
full-text indexing, so you have an option to schedule when the indexing
of the modified rows is done.
By default, when you create a new
full-text index, change tracking is enabled. In other words, a full
population is done and when it completes, all rows modified during the
full population and after it completes are indexed. So the following
statements are equivalent:
Use AdventureWorks;CREATE FULLTEXT INDEX ON Person.Contact(Firstname)
KEY INDEX pk_Contact_ContactID WITH CHANGE_TRACKING AUTO
Use AdventureWorks;CREATE FULLTEXT INDEX ON Person.Contact(Firstname)
KEY INDEX pk_Contact_ContactID
Because
change tracking causes some locking, you can schedule rows to be
tracked in real-time but indexed only at scheduled intervals by using
the following statement:
Use AdventureWorks;
CREATE FULLTEXT INDEX ON Person.Contact(Firstname)
KEY INDEX pk_Contact_ContactID WITH CHANGE_TRACKING MANUAL
The preceding command
assumes a default index. If you do not have a default catalog, you would
have to specify a named one like this:
Use AdventureWorks;
CREATE FULLTEXT INDEX ON Person.Contact(Firstname)
KEY INDEX pk_Contact_ContactID ON DEFAULT_FULLTEXT_CATALOG WITH
CHANGE_TRACKING MANUAL
To update your index, you issue the following
Use AdventureWorks;
ALTER FULLTEXT INDEX ON Person.Contact START UPDATE
ALTER FULLTEXT INDEX
As you have seen in this article, you can use the ALTER FULLTEXT INDEX
command to manage populations. You can also use it for a wide variety
of index maintenance tasks. Here are its parameters, which are discussed
in the following sections:
ENABLE and DISABLE
The ENABLE and DISABLE
parameters enable and disable full-text indexing on a table. When you
use them, you can still conduct full-text searches on your full-text
indexed tables, but the catalogs are no longer kept up-to-date.
For example, you could disable indexing with the following command:
Use AdventureWorks;
ALTER FULLTEXT INDEX ON Person.Contact DISABLE
And then you could re-enable indexing with the following:
Use AdventureWorks;
ALTER FULLTEXT INDEX ON Person.Contact ENABLE
When you re-enable a
full-text index, change tracking commences to update changes that
occurred while full-text indexing was disabled. If you disabled change
tracking prior to disabling the full-text index, you have to run a full
or incremental population to get your catalog up-to-date.
SET CHANGE_TRACKING
The SET CHANGE_TRACKING
option allows you to control change tracking. For example, you can turn
it off, turn it on, or schedule it. Because change tracking does cause
some locking, you might want to schedule it during a quiet time when the
database is not under load to minimize the impact of the locking.
Here is an example of the use of SET CHANGE_TRACKING:
Use AdventureWorks;
ALTER FULLTEXT INDEX ON Person.Contact SET CHANGE_TRACKING AUTO
The options for setting change tracking are as follows:
AUTO— Enables continuous real-time indexing.
OFF— Disables change tracking.
MANUAL— Provides continuous change tracking, but rows are indexed only when you issue the following command:
Use AdventureWorks;
ALTER FULLTEXT INDEX ON Person.Contact Start Update Population
ADD
You use the ADD parameter to add a new column to a full-text index. For example, consider Person.Contact, a table in the AdventureWorks database, with three char columns on it: Firstname, Lastname, and EmailAddress. You have already created a full-text index on Firstname and Lastname. You could add full-text indexing to EmailAddress by issuing the following command:
Use AdventureWorks;
ALTER FULLTEXT INDEX ON Person.Contact ADD(EmailAddress)
As soon as you add the
new column, a full population is done to index the contents of the newly
added column. You can disable it with the WITH NO POPULATION clause, as in this example:
Use AdventureWorks;
ALTER FULLTEXT INDEX ON Person.Contact ADD(EmailAddress) WITH NO POPULATION
You may get the following message:
Msg 7663, Level 16, State 2, Line 2
Option 'WITH NO POPULATION' should not be used when change tracking is enabled.
This message indicates the change
tracking is on. To prevent a population starting immediately after
adding the column, you would first have to disable change tracking and
then make your change as illustrated in the following example:
ALTER FULLTEXT INDEX ON Person.Contact
SET CHANGE_TRACKING OFF
ALTER FULLTEXT INDEX ON Person.Contact ADD(EmailAddress)
WITH NO POPULATION
You also have the option to
specify a specific word breaker to be used or a document type column to
reference whether the column you add is an image or varbinary(max) column.
DROP
Like the ADD parameter, the DROP parameter allows you to drop a full-text column you are indexing. This parameter also supports the WITH NO POPULATION clause, which disables automatic re-indexing after you drop the full-text column. Here is an example of its use:
Use AdventureWorks;
ALTER FULLTEXT INDEX ON Person.Contact DROP (Firstname) WITH NO POPULATION
Again, you may get the following message:
Msg 7663, Level 16, State 2, Line 2
Option 'WITH NO POPULATION' should not be used when change tracking is enabled.
This message indicates the
change tracking is on. To prevent a population starting immediately
after adding the column, you would first have to disable change tracking
and then make your change as illustrated in the following example:
ALTER FULLTEXT INDEX ON Person.Contact
SET CHANGE_TRACKING OFF
ALTER FULLTEXT INDEX ON Person.Contact DROP(EmailAddress)
WITH NO POPULATION
The DROP command can be used to drop all the full-text columns on a table.
START and STOP
The START and STOP parameters can be used to start and stop full, incremental, or update populations. Following is the typical syntax:
Use AdventureWorks;
ALTER FULLTEXT INDEX ON Person.Contact Stop Population
Use AdventureWorks;
ALTER FULLTEXT INDEX ON Person.Contact Start Full Population
The
update population is used in conjunction with change tracking, for
example, if you set up change tracking in manual mode like this:
Use AdventureWorks;
ALTER FULLTEXT INDEX ON Person.Contact SET CHANGE_TRACKING Manual
Use AdventureWorks;
ALTER FULLTEXT INDEX ON Person.Contact START Update Population
We’ve completed our
look at the catalog and index creation statements. Next, we look at how
to manage full-text catalogs and indexes.
Managing MSFTESQL
After you create full-text catalogs and indexes, you might need to manage the full-text engine. The command used to do this is sp_fulltext_service, which accepts the following parameters:
Following are the acceptable values for the @action parameter:
load_os_resources—
Controls whether the full-text engine loads word breakers and IFilters
that are not part of SQL Server but are installed in the OS. A value of 1 loads the OS word breakers and IFilters.
pause_indexing— Pauses the indexing process. During this pause, you can still query the full-text catalogs.
resource_usage— Is used for backward compatibility.
update_languages— Updates the language cache with recently installed word breakers.
verify_signature— Disables the checking of signatures for word breakers and IFilters when set to 0. When set to the default, 1, signatures are checked.
upgrade_option—
Controls how SQL Server processes catalogs in a database that are
restored or attached to SQL Server 2008. It accepts three values: 0, which forces attached or restored databases with full-text catalogs to be rebuilt; 1, which means the full-text catalogs’ metadata remains, but the catalog contents are deleted (these catalogs are queryable, but no results are returned until you rebuild them); 2,
which means the full-text indexes are imported into the database
(however, the results may be inconsistent because some of the full-text
indexes are generated by the SQL 2005 full-text word breakers and not
the SQL Server 2008 word breakers).
Now you know how to build
full-text catalogs and indexes and modify them. The next section
describes how to get information on the catalogs and indexes you build.